{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using PyTorch with TensorRT through ONNX:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TensorRT is a great way to take a trained PyTorch model and optimize it to run more efficiently during inference on an NVIDIA GPU.\n",
    "\n",
    "One approach to convert a PyTorch model to TensorRT is to export a PyTorch model to ONNX (an open format exchange for deep learning models) and then convert into a TensorRT engine. Essentially, we will follow this path to convert and deploy our model:\n",
    "\n",
    "![PyTorch+ONNX](./images/pytorch_onnx.png)\n",
    "\n",
    "Both TensorFlow and PyTorch models can be exported to ONNX, as well as many other frameworks. This allows models created using either framework to flow into common downstream pipelines.\n",
    "\n",
    "To get started, let's take a well-known computer vision model and follow five key steps to deploy it to the TensorRT Python runtime:\n",
    "\n",
    "1. __What format should I save my model in?__\n",
    "2. __What batch size(s) am I running inference at?__\n",
    "3. __What precision am I running inference at?__\n",
    "4. __What TensorRT path am I using to convert my model?__\n",
    "5. __What runtime am I targeting?__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. What format should I save my model in?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to use ResNet50, a widely used CNN architecture first described in <a href=https://arxiv.org/abs/1512.03385>this paper</a>.\n",
    "\n",
    "Let's start by loading dependencies and downloading the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchvision.models as models\n",
    "import torch\n",
    "import torch.onnx\n",
    "\n",
    "# load the pretrained model\n",
    "resnet50 = models.resnet50(pretrained=True, progress=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will select our batch size and export the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up a dummy input tensor and export the model to ONNX\n",
    "BATCH_SIZE = 32\n",
    "dummy_input=torch.randn(BATCH_SIZE, 3, 224, 224)\n",
    "torch.onnx.export(resnet50, dummy_input, \"resnet50_pytorch.onnx\", verbose=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we are picking a BATCH_SIZE of 4 in this example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use a benchmarking function included in this guide to time this model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warm up ...\n",
      "Start timing ...\n",
      "Iteration 1000/1000, ave batch time 10.19 ms\n",
      "Input shape: torch.Size([1, 3, 224, 224])\n",
      "Output features size: torch.Size([1, 1000])\n",
      "Average batch time: 10.19 ms\n"
     ]
    }
   ],
   "source": [
    "from benchmark import benchmark\n",
    "\n",
    "resnet50.to(\"cuda\").eval()\n",
    "benchmark(resnet50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's restart our Jupyter Kernel so PyTorch doesn't collide with TensorRT: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os._exit(0) # Shut down all kernels so TRT doesn't fight with PyTorch for GPU memory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. What batch size(s) am I running inference at?\n",
    "\n",
    "We are going to run with a fixed batch size of 4 for this example. Note that above we set BATCH_SIZE to 4 when saving our model to ONNX. We need to create another dummy batch of the same size (this time it will need to be in our target precision) to test out our engine.\n",
    "\n",
    "First, as before, we will set our BATCH_SIZE to 4. Note that our trtexec command above includes the '--explicitBatch' flag to signal to TensorRT that we will be using a fixed batch size at runtime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "BATCH_SIZE = 32"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Importantly, by default TensorRT will use the input precision you give the runtime as the default precision for the rest of the network. So before we create our new dummy batch, we also need to choose a precision as in the next section:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. What precision am I running inference at?\n",
    "\n",
    "Remember that lower precisions than FP32 tend to run faster. There are two common reduced precision modes - FP16 and INT8. Graphics cards that are designed to do inference well often have an affinity for one of these two types. This guide was developed on an NVIDIA V100, which favors FP16, so we will use that here by default. INT8 is a more complicated process that requires a calibration step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "USE_FP16 = True\n",
    "\n",
    "target_dtype = np.float16 if USE_FP16 else np.float32\n",
    "dummy_input_batch = np.zeros((BATCH_SIZE, 224, 224, 3), dtype = np.float32) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. What TensorRT path am I using to convert my model?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use trtexec, a command line tool for working with TensorRT, in order to convert an ONNX model originally from PyTorch to an engine file.\n",
    "\n",
    "Let's make sure we have TensorRT installed (this comes with trtexec):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorrt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To convert the model we saved in the previous step, we need to point to the ONNX file, give trtexec a name to save the engine as, and last specify that we want to use a fixed batch size instead of a dynamic one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "&&&& RUNNING TensorRT.trtexec # trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt --explicitBatch --fp16\n",
      "[01/30/2021-02:11:40] [I] === Model Options ===\n",
      "[01/30/2021-02:11:40] [I] Format: ONNX\n",
      "[01/30/2021-02:11:40] [I] Model: resnet50_pytorch.onnx\n",
      "[01/30/2021-02:11:40] [I] Output:\n",
      "[01/30/2021-02:11:40] [I] === Build Options ===\n",
      "[01/30/2021-02:11:40] [I] Max batch: explicit\n",
      "[01/30/2021-02:11:40] [I] Workspace: 16 MiB\n",
      "[01/30/2021-02:11:40] [I] minTiming: 1\n",
      "[01/30/2021-02:11:40] [I] avgTiming: 8\n",
      "[01/30/2021-02:11:40] [I] Precision: FP32+FP16\n",
      "[01/30/2021-02:11:40] [I] Calibration: \n",
      "[01/30/2021-02:11:40] [I] Refit: Disabled\n",
      "[01/30/2021-02:11:40] [I] Safe mode: Disabled\n",
      "[01/30/2021-02:11:40] [I] Save engine: resnet_engine_pytorch.trt\n",
      "[01/30/2021-02:11:40] [I] Load engine: \n",
      "[01/30/2021-02:11:40] [I] Builder Cache: Enabled\n",
      "[01/30/2021-02:11:40] [I] NVTX verbosity: 0\n",
      "[01/30/2021-02:11:40] [I] Tactic sources: Using default tactic sources\n",
      "[01/30/2021-02:11:40] [I] Input(s)s format: fp32:CHW\n",
      "[01/30/2021-02:11:40] [I] Output(s)s format: fp32:CHW\n",
      "[01/30/2021-02:11:40] [I] Input build shapes: model\n",
      "[01/30/2021-02:11:40] [I] Input calibration shapes: model\n",
      "[01/30/2021-02:11:40] [I] === System Options ===\n",
      "[01/30/2021-02:11:40] [I] Device: 0\n",
      "[01/30/2021-02:11:40] [I] DLACore: \n",
      "[01/30/2021-02:11:40] [I] Plugins:\n",
      "[01/30/2021-02:11:40] [I] === Inference Options ===\n",
      "[01/30/2021-02:11:40] [I] Batch: Explicit\n",
      "[01/30/2021-02:11:40] [I] Input inference shapes: model\n",
      "[01/30/2021-02:11:40] [I] Iterations: 10\n",
      "[01/30/2021-02:11:40] [I] Duration: 3s (+ 200ms warm up)\n",
      "[01/30/2021-02:11:40] [I] Sleep time: 0ms\n",
      "[01/30/2021-02:11:40] [I] Streams: 1\n",
      "[01/30/2021-02:11:40] [I] ExposeDMA: Disabled\n",
      "[01/30/2021-02:11:40] [I] Data transfers: Enabled\n",
      "[01/30/2021-02:11:40] [I] Spin-wait: Disabled\n",
      "[01/30/2021-02:11:40] [I] Multithreading: Disabled\n",
      "[01/30/2021-02:11:40] [I] CUDA Graph: Disabled\n",
      "[01/30/2021-02:11:40] [I] Separate profiling: Disabled\n",
      "[01/30/2021-02:11:40] [I] Skip inference: Disabled\n",
      "[01/30/2021-02:11:40] [I] Inputs:\n",
      "[01/30/2021-02:11:40] [I] === Reporting Options ===\n",
      "[01/30/2021-02:11:40] [I] Verbose: Disabled\n",
      "[01/30/2021-02:11:40] [I] Averages: 10 inferences\n",
      "[01/30/2021-02:11:40] [I] Percentile: 99\n",
      "[01/30/2021-02:11:40] [I] Dump refittable layers:Disabled\n",
      "[01/30/2021-02:11:40] [I] Dump output: Disabled\n",
      "[01/30/2021-02:11:40] [I] Profile: Disabled\n",
      "[01/30/2021-02:11:40] [I] Export timing to JSON file: \n",
      "[01/30/2021-02:11:40] [I] Export output to JSON file: \n",
      "[01/30/2021-02:11:40] [I] Export profile to JSON file: \n",
      "[01/30/2021-02:11:40] [I] \n",
      "[01/30/2021-02:11:40] [I] === Device Information ===\n",
      "[01/30/2021-02:11:40] [I] Selected Device: Tesla V100-DGXS-16GB\n",
      "[01/30/2021-02:11:40] [I] Compute Capability: 7.0\n",
      "[01/30/2021-02:11:40] [I] SMs: 80\n",
      "[01/30/2021-02:11:40] [I] Compute Clock Rate: 1.53 GHz\n",
      "[01/30/2021-02:11:40] [I] Device Global Memory: 16155 MiB\n",
      "[01/30/2021-02:11:40] [I] Shared Memory per SM: 96 KiB\n",
      "[01/30/2021-02:11:40] [I] Memory Bus Width: 4096 bits (ECC enabled)\n",
      "[01/30/2021-02:11:40] [I] Memory Clock Rate: 0.877 GHz\n",
      "[01/30/2021-02:11:40] [I] \n",
      "----------------------------------------------------------------\n",
      "Input filename:   resnet50_pytorch.onnx\n",
      "ONNX IR version:  0.0.6\n",
      "Opset version:    9\n",
      "Producer name:    pytorch\n",
      "Producer version: 1.8\n",
      "Domain:           \n",
      "Model version:    0\n",
      "Doc string:       \n",
      "----------------------------------------------------------------\n",
      "[01/30/2021-02:11:57] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.\n",
      "[01/30/2021-02:12:38] [I] [TRT] Detected 1 inputs and 1 output network tensors.\n",
      "[01/30/2021-02:12:38] [I] Engine built in 58.341 sec.\n",
      "[01/30/2021-02:12:39] [I] Starting inference\n",
      "[01/30/2021-02:12:42] [I] Warmup completed 0 queries over 200 ms\n",
      "[01/30/2021-02:12:42] [I] Timing trace has 0 queries over 3.01474 s\n",
      "[01/30/2021-02:12:42] [I] Trace averages of 10 runs:\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.54261 ms - Host latency: 7.12485 ms (end to end 11.0132 ms, enqueue 0.490388 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.54179 ms - Host latency: 7.13058 ms (end to end 10.3497 ms, enqueue 0.473805 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.49775 ms - Host latency: 7.08051 ms (end to end 10.7457 ms, enqueue 0.473593 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.48351 ms - Host latency: 7.0662 ms (end to end 10.6246 ms, enqueue 0.477765 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.48168 ms - Host latency: 7.07059 ms (end to end 10.6304 ms, enqueue 0.511209 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.46939 ms - Host latency: 7.05303 ms (end to end 10.8701 ms, enqueue 0.476138 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.46038 ms - Host latency: 7.04825 ms (end to end 10.844 ms, enqueue 0.510028 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45701 ms - Host latency: 7.04156 ms (end to end 10.4863 ms, enqueue 0.481628 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.46315 ms - Host latency: 7.04678 ms (end to end 10.8546 ms, enqueue 0.486493 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45135 ms - Host latency: 7.03293 ms (end to end 10.8368 ms, enqueue 0.451886 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45332 ms - Host latency: 7.03381 ms (end to end 10.4568 ms, enqueue 0.448792 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.47145 ms - Host latency: 7.05214 ms (end to end 10.8879 ms, enqueue 0.450238 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.46673 ms - Host latency: 7.0481 ms (end to end 10.8669 ms, enqueue 0.458594 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.4571 ms - Host latency: 7.03876 ms (end to end 10.8491 ms, enqueue 0.45224 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44923 ms - Host latency: 7.03019 ms (end to end 10.8304 ms, enqueue 0.456989 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45555 ms - Host latency: 7.04651 ms (end to end 10.8335 ms, enqueue 0.452844 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.43733 ms - Host latency: 7.13279 ms (end to end 10.8064 ms, enqueue 0.456653 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.4527 ms - Host latency: 7.19491 ms (end to end 10.8319 ms, enqueue 0.471423 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45814 ms - Host latency: 7.21376 ms (end to end 10.8433 ms, enqueue 0.478052 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45199 ms - Host latency: 7.20497 ms (end to end 10.8342 ms, enqueue 0.491235 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45476 ms - Host latency: 7.20305 ms (end to end 10.835 ms, enqueue 0.481665 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45189 ms - Host latency: 7.20339 ms (end to end 10.8307 ms, enqueue 0.483386 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45106 ms - Host latency: 7.19846 ms (end to end 10.8319 ms, enqueue 0.509949 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.43886 ms - Host latency: 7.18525 ms (end to end 10.7492 ms, enqueue 0.416125 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.43108 ms - Host latency: 7.16941 ms (end to end 10.8011 ms, enqueue 0.404297 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44432 ms - Host latency: 7.19043 ms (end to end 10.82 ms, enqueue 0.413928 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44503 ms - Host latency: 7.19503 ms (end to end 10.8254 ms, enqueue 0.411731 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45649 ms - Host latency: 7.20958 ms (end to end 10.8403 ms, enqueue 0.412561 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44401 ms - Host latency: 7.19244 ms (end to end 10.8213 ms, enqueue 0.412598 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44102 ms - Host latency: 7.18861 ms (end to end 10.8133 ms, enqueue 0.436987 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45006 ms - Host latency: 7.20289 ms (end to end 10.833 ms, enqueue 0.439331 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.4575 ms - Host latency: 7.20284 ms (end to end 10.8427 ms, enqueue 0.436646 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45487 ms - Host latency: 7.20845 ms (end to end 10.8413 ms, enqueue 0.439319 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.4395 ms - Host latency: 7.1884 ms (end to end 10.8078 ms, enqueue 0.43324 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.48342 ms - Host latency: 7.23835 ms (end to end 10.8946 ms, enqueue 0.439404 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45503 ms - Host latency: 7.19844 ms (end to end 10.8375 ms, enqueue 0.440918 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.43992 ms - Host latency: 7.1866 ms (end to end 10.8096 ms, enqueue 0.442407 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45095 ms - Host latency: 7.19233 ms (end to end 10.8367 ms, enqueue 0.435132 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.4522 ms - Host latency: 7.20203 ms (end to end 10.8335 ms, enqueue 0.436938 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44805 ms - Host latency: 7.19414 ms (end to end 10.8289 ms, enqueue 0.434473 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44307 ms - Host latency: 7.19053 ms (end to end 10.8143 ms, enqueue 0.439136 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45256 ms - Host latency: 7.20137 ms (end to end 10.8323 ms, enqueue 0.440991 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45581 ms - Host latency: 7.20334 ms (end to end 10.8416 ms, enqueue 0.464575 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44614 ms - Host latency: 7.19006 ms (end to end 10.8206 ms, enqueue 0.4677 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45139 ms - Host latency: 7.20149 ms (end to end 10.8334 ms, enqueue 0.46626 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44536 ms - Host latency: 7.19265 ms (end to end 10.817 ms, enqueue 0.465039 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44517 ms - Host latency: 7.19331 ms (end to end 10.8181 ms, enqueue 0.470776 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45247 ms - Host latency: 7.20457 ms (end to end 10.8336 ms, enqueue 0.47019 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45178 ms - Host latency: 7.20559 ms (end to end 10.83 ms, enqueue 0.464844 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44983 ms - Host latency: 7.19641 ms (end to end 10.827 ms, enqueue 0.470581 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.45164 ms - Host latency: 7.19951 ms (end to end 10.834 ms, enqueue 0.469531 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.4571 ms - Host latency: 7.20801 ms (end to end 10.8466 ms, enqueue 0.462158 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44883 ms - Host latency: 7.20139 ms (end to end 10.8204 ms, enqueue 0.473437 ms)\n",
      "[01/30/2021-02:12:42] [I] Average on 10 runs - GPU latency: 5.44988 ms - Host latency: 7.19773 ms (end to end 10.8268 ms, enqueue 0.466772 ms)\n",
      "[01/30/2021-02:12:42] [I] Host Latency\n",
      "[01/30/2021-02:12:42] [I] min: 6.96747 ms (end to end 7.06885 ms)\n",
      "[01/30/2021-02:12:42] [I] max: 7.6875 ms (end to end 11.3145 ms)\n",
      "[01/30/2021-02:12:42] [I] mean: 7.15645 ms (end to end 10.8042 ms)\n",
      "[01/30/2021-02:12:42] [I] median: 7.18567 ms (end to end 10.8351 ms)\n",
      "[01/30/2021-02:12:42] [I] percentile: 7.24304 ms at 99% (end to end 11.0522 ms at 99%)\n",
      "[01/30/2021-02:12:42] [I] throughput: 0 qps\n",
      "[01/30/2021-02:12:42] [I] walltime: 3.01474 s\n",
      "[01/30/2021-02:12:42] [I] Enqueue Time\n",
      "[01/30/2021-02:12:42] [I] min: 0.349854 ms\n",
      "[01/30/2021-02:12:42] [I] max: 0.832886 ms\n",
      "[01/30/2021-02:12:42] [I] median: 0.460693 ms\n",
      "[01/30/2021-02:12:42] [I] GPU Compute\n",
      "[01/30/2021-02:12:42] [I] min: 5.38318 ms\n",
      "[01/30/2021-02:12:42] [I] max: 5.94751 ms\n",
      "[01/30/2021-02:12:42] [I] mean: 5.45704 ms\n",
      "[01/30/2021-02:12:42] [I] median: 5.45483 ms\n",
      "[01/30/2021-02:12:42] [I] percentile: 5.5603 ms at 99%\n",
      "[01/30/2021-02:12:42] [I] total compute time: 2.97955 s\n",
      "&&&& PASSED TensorRT.trtexec # trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt --explicitBatch --fp16\n"
     ]
    }
   ],
   "source": [
    "# step out of Python for a moment to convert the ONNX model to a TRT engine using trtexec\n",
    "if USE_FP16:\n",
    "    !trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt  --explicitBatch --fp16\n",
    "else:\n",
    "    !trtexec --onnx=resnet50_pytorch.onnx --saveEngine=resnet_engine_pytorch.trt  --explicitBatch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will save our model as 'resnet_engine.trt'."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. What TensorRT runtime am I targeting?\n",
    "\n",
    "Now, we have a converted our model to a TensorRT engine. Great! That means we are ready to load it into the native Python TensorRT runtime. This runtime strikes a balance between the ease of use of the high level Python APIs used in frameworks and the fast, low level C++ runtimes available in TensorRT."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorrt as trt\n",
    "import pycuda.driver as cuda\n",
    "import pycuda.autoinit\n",
    "\n",
    "f = open(\"resnet_engine_pytorch.trt\", \"rb\")\n",
    "runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) \n",
    "\n",
    "engine = runtime.deserialize_cuda_engine(f.read())\n",
    "context = engine.create_execution_context()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now allocate input and output memory, give TRT pointers (bindings) to it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# need to set input and output precisions to FP16 to fully enable it\n",
    "output = np.empty([BATCH_SIZE, 1000], dtype = target_dtype) \n",
    "\n",
    "# allocate device memory\n",
    "d_input = cuda.mem_alloc(1 * dummy_input_batch.nbytes)\n",
    "d_output = cuda.mem_alloc(1 * output.nbytes)\n",
    "\n",
    "bindings = [int(d_input), int(d_output)]\n",
    "\n",
    "stream = cuda.Stream()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, set up the prediction function.\n",
    "\n",
    "This involves a copy from CPU RAM to GPU VRAM, executing the model, then copying the results back from GPU VRAM to CPU RAM:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict(batch): # result gets copied into output\n",
    "    # transfer input data to device\n",
    "    cuda.memcpy_htod_async(d_input, batch, stream)\n",
    "    # execute model\n",
    "    context.execute_async_v2(bindings, stream.handle, None)\n",
    "    # transfer predictions back\n",
    "    cuda.memcpy_dtoh_async(output, d_output, stream)\n",
    "    # syncronize threads\n",
    "    stream.synchronize()\n",
    "    \n",
    "    return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's time the function!\n",
    "\n",
    "Note that we're going to include the extra CPU-GPU copy time in this evaluation, so it won't be directly comparable with our TRTorch model performance as it also includes additional overhead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warming up...\n",
      "Done warming up!\n"
     ]
    }
   ],
   "source": [
    "print(\"Warming up...\")\n",
    "\n",
    "predict(dummy_input_batch)\n",
    "\n",
    "print(\"Done warming up!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7.15 ms ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "\n",
    "pred = predict(dummy_input_batch)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, even with the CPU-GPU copy, this is still faster than our raw PyTorch model!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next Steps:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h4> Profiling </h4>\n",
    "\n",
    "This is a great next step for further optimizing and debugging models you are working on productionizing\n",
    "\n",
    "You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html\n",
    "\n",
    "<h4>  TRT Dev Docs </h4>\n",
    "\n",
    "Main documentation page for the ONNX, layer builder, C++, and legacy APIs\n",
    "\n",
    "You can find it here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html\n",
    "\n",
    "<h4>  TRT OSS GitHub </h4>\n",
    "\n",
    "Contains OSS TRT components, sample applications, and plugin examples\n",
    "\n",
    "You can find it here: https://github.com/NVIDIA/TensorRT\n",
    "\n",
    "\n",
    "#### TRT Supported Layers:\n",
    "\n",
    "https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/samplePlugin\n",
    "\n",
    "#### TRT ONNX Plugin Example:\n",
    "\n",
    "https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#layers-precision-matrix"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
